Background of The Problem

The target attribute for classification is Category (blood donors vs. Hepatitis C (including its progress ('just' Hepatitis C, Fibrosis, Cirrhosis ).

Background of the Disease

reference_link

Who is at risk for hepatitis C infection?

The following people are at increased risk for hepatitis C:

Who is more likely to develop cirrhosis after becoming infected with HCV?

Rates of progression to cirrhosis are increased in the presence of a variety of factors, including

What are the signs and symptoms of chronic HCV infection?

Most people with chronic HCV infection are asymptomatic or have non-specific symptoms such as chronic fatigue and depression. Many eventually develop chronic liver disease, which can range from mild to severe, including cirrhosis and liver cancer. Chronic liver disease in HCV-infected people is usually insidious, progressing slowly without any signs or symptoms for several decades. In fact, HCV infection is often not recognized until asymptomatic people are identified as HCV-positive when screened for blood donation or when elevated alanine aminotransferase (ALT, a liver enzyme) levels are detected during routine examinations.

Tests

There are several laboratory tests that may be performed in cases of known or suspected hepatitis. These tests may fall into one or more of the following categories:

Assumption for Variables

Setup

Load Libraries

Settings

Global Function

Data

Preview

There are 615 rows with 13 columns. We can observe there are different categories for Cateogry & Sex columns, and all other columns is numeric as describe in the dataset information. However, we also noticed there are some missing values in the ALP columns.

According to the question, we wanted to classify Hepatitis and Blood Donor. Hence, we will create another column to change the category. Besides, we might not want to include suspect blood donor at the moment.

Remove Suspect Blood Donor

General Category

Now we have 533 Blood Donor and 75 Hepatitis patients with diferrent stages.

Handle Missing Value

A slight change. However it doesnt seems it impact the dataset. Hence we can ignore this at the moment.

Numeric Columns

Age is included in the column. However it might not be important.

Selected Columns

Summary Statistics & Boxplot

Summary Statistics on Numeric Columns

From the table, we can observe that the ALB & PROT seems like a form of percentage where it does not exceed 100 which is matching with our variable assumption from the start.

Boxplot

From the summary statistics, we observed that:

Boxplot - by Category

From the box plot, (do note that the comparison is always between Sick Category and Non Sick Category when the terms relatively is used)

All the points above, are observations / assumptions made to ease our task later on. It's not a definite conclusion until we futher analyse the dataset.

Boxplot - by General Category

As expected, the trend here will be easier to observe comparitively. Hence we will proceed with using general category instead of Category columns.

From the boxplot, (do note that Hepatitis here include Cirrhosis and Firbosis)

Boxplot - by Sex

In general. Most of the data seems scatter in the same range and does not differ much comparitively. Hence we will not use gender in this data.

The summary above is more or less a similar analysis with boxplot. However we could observe the skewness.

Huge difference of skewness (more than 1.0) between groups

Different direction of skewness between groups:

So from here, we make assumptions that the patient using min max and median,

Correlation Plot

Looking at correlation matrix, those that have higher correlation with each other are:

and also, it seems our PROT is correlated with ALB where it's a ratio of it as mentioned in data preview.

Conclusion

Based on the analysis above, we might interested to look into different combination of variables instead of plotting the all the variables. The combination of variables will be:

Scatterplot

Due to the different range of value for each columns, hence we used scale value to have a better look.

Cols

Assumptions: Seems like ALP & AST can be a good pair to seperate the variables.

These highly correlated columns have both samples overlap with each other.

PCA

PCA on selected columns

In general this selected columns seems to be mixed together.

PCA on Cols.1

Based on the scatter plot, it seems that it does not separate the Hepatitis patient from blood donor well.

PCA on Cols.2

Cols.2 doesnt seems working.

PCA on Cols.3

The cols.3 seems to be better but there's alot of noise within the diabetes cluster.

PCA on Cols.4

Conclusion

Seems like for Cols.4 can have a better separation compare to others for component1 and component4 The variable includes in Cols.4 are

Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 ALP 0.389 0.553 0.339 0.497 0.425 ALT 0.293 -0.486 -0.513 0.644 AST 0.537 -0.382 -0.506 0.553 GGT 0.666 0.183 -0.714 CREA 0.173 0.557 -0.765 -0.270

Feature Selection

Recursive Feature Elimination

The samples codes are from machine learning mastery.

Recursive feature elimination is a feature selection method that fits a model and removes the weakest feature until the specified number of features is reached. Feature are ranked using feature importance / coefficient. RFE attempts to eliminate dependencies and collinearity that may exist in the model.

Conclusion

The top 5 variables shown are AST, ALP, ALT, GGT, CHE which matching with 4 of the columns from our pca using AST, ALP, ALT, GGT, CREA in our pca.cols.4

However, from the accuracy graph, we can observe that it's having a highest accuracy with top 11 features (exclude the ALB). However the accuracy of the features have little improvement when using 5-10 features. Hence, if we are trying to use lesser features to predict Hepatitis patients, top 3 features will be the best at the moment.

Chi Squared Test

Even though the p-value is 0.08 which is more than alpha 0.05 (set by ourselve). Doesnt seems convincing as male have more patients in general.

F Test

Based on F Test result, the top 5 most significant are CREA, BIL, AST, GGT, ALT

3D Scatter Plot

From the plot, we can observe that AST more than certain treshold (50), and extremely low / high level of ALT, can differentiate the Hepatitis Patient.